library(gganimate)
library(gifski)
library(tidyverse)
library(jpeg)
library(pacman)
library(grid)
library(broom)
library(gridExtra)

1 Regression basics

significant_regressors <- function(y, X, signif_level) {
  # returns a vector of significant regressor indices
}

remove_insignificant <- function(y, X, signif_level) {
  # removes insignificant regressors from X matrix
  # based on lm(y~X) and given significance level
}

algorithm <- function(m, n, nsteps = 4) {
  # Algorithm for steps 1-5 (possibly less)
  # returns  number of significant regressors after nsteps
}

simulation <- function(n, nsteps) {
  # Applies the algorithm to randomly generated data
}

number_signif_histogram <- function(nsteps = 4) {
  # Draws histogram of no. significant regressors for
  # nsteps of algorithm
}

Task: Repeat this process many times and plot the distributions of the number of significant regressors in Step 5.

# how many regressors are significant at level 0.05 in step 4?
num_signif <- simulation(1, 4)
num_signif
## [1] 11
#number_signif_histogram()

Task: Explain your findings.

Our findings are in line with results of Freedman’s simulations (1983). First, we fit his assumptions, when number of our randomly generated variables is (roughly) comparable to number of data points.
Our \(R^2\) is high simply because we have put many regressors into the linear model and relatively large number of variables have passed the 0.5 \(p\)-value threshold from Step 2.

When we exclude variables which don’t have \(p\)-value < 0.5, we lose some of the explained variance of the model purely by composition of \(R^2\) (even in our case, when our model is filled only by random noise), so \(R^2\) will have lower value. More interestingly, the structure and number of significant variables also changes. More variables now have higher statistical significance than before the \(p\)-value cutoff. This may be caused by the fact that regressors may be (even if weakly) correlated and excluding one variable can result in rise in significance of the other.

Similarly, when we repeat whole procedure for steps 3-5, \(R^2\) will get lower, number of variables will be lower, but their significance is rising. However, their significance is biased by previous steps, which were taken in order to adjust the whole model.

Lesson to take:

  1. In linear regression model, less regressors sometimes means more credible results. Just adding too many regressors may lead to high \(R^2\) but also to multicollinearity and misleading results.
  2. It is also important to determine the main purpose of the model. If it is an explanatory model, then the \(p\)-value of a variable may not be the reason for excluding the variable, when we are sure that there is some economic theory behind the connection of dependent and explanatory variable. However, if we want to use the model for prediction, then we may be interested in putting as much valuable variables as possible to obtain strong predictive power of the model.
  3. Creating model by \(p\)-value cutting is not in line with causal inference process, because it eventually leads to model with strong theoretical significance, but such model might not reflect real life bonds between variables.

Task: Compare the distributions of the numbers of significant regressors. In different steps of the algorithm number of significant regressors is higher with fewer steps, distribution seems to be normal in each case.

# Distributions of the numbers of significant regressors in
# different steps
p1 <- number_signif_histogram(1)
p2 <- number_signif_histogram(2)
p3 <- number_signif_histogram(3)
p4 <- number_signif_histogram(4)
grid.arrange(p1, p2, p3, p4, ncol=2, nrow = 2)

2 Maximum likelihood

We have \(y_i \sim Bin(n_i, p_i)\) and \(\text{ln}(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_{i1}+\dots+\beta_px_{ip}\).

Marginal probability mass function (pmf): \[P(Y_i=y_i) = {n_i \choose y_i} p_i^{y_i}(1-p_i)^{n_i-y_i} \] \[p_i = \frac{e^{\theta_i}}{1+e^{\theta_i}}, \quad \theta_i = \text{ln}\biggl(\frac{p_i}{1-p_i}\biggr) = \beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}\]

Joint probability mass function: \[f(y_i \space | \space n_i, p_i) = \prod_{i=1}^n{n_i \choose y_i}p_i^{y_i}(1-p_i)^{n_i-y_i}\] substitute \(p_i\) for \(\frac{e^{\theta_i}}{1+e^{\theta_i}}\):

\[ = \prod_{i=1}^n{n_i \choose y_i} \biggl[\frac{e^{\theta_i}}{1+e^{\theta_i}}\biggr]^{y_i}\biggl[1-\frac{e^{\theta_i}}{1+e^{\theta_i}}\biggr]^{n_i-y_i}\] Likelihood function:

\[L(\hat{\beta} \space | \space y_i, x_i, n_i) = \prod_{i=1}^n{n_i \choose y_i} \biggl[\frac{e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{y_i}\biggl[1-\frac{e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{n_i-y_i}\]

Log likelihood:

\[l(\hat{\beta}\space | \space y_i, x_i, n_i) = \text{ln}(L(\hat{\beta}\space | \space y_i, x_i, n_i))\] \[ = \sum_{i=1}^n \text{ln}\Biggl[{n_i \choose y_i} \biggl[\frac{\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{y_i}\biggl[1-\frac{\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{n_i-y_i}\Biggr]\] \[ = \sum_{i=1}^n\Biggl[\text{ln}{n_i \choose y_i} + y_i\Bigl(\text{ln}(e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}) - \text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Bigr) + (n_i-y_i)\Bigl(\text{ln}(1) - \text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Bigr)\Biggr]\] \[ = \sum_{i=1}^n\Biggl[\text{ln}{n_i \choose y_i} + y_i(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}) - n_i\cdot\text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Biggr]\] \[ = \sum_{i=1}^n\text{ln}{n_i \choose y_i} + y_i \theta_i - n_i \text{ln}(1+e^{\theta_i}) \]

Score function: \[S(\hat{\beta}) = \frac{\partial{}}{\partial{\hat{\beta}}}l(\hat{\beta} \space | \space y_i, x_i, n_i) = \sum_{i=1}^n\biggl(y_i + \frac{n_i\theta_i}{1+\text{e}^{\theta_i}}\biggr)\] \[ = \sum_{i=1}^n\biggl(y_i + \frac{n_i(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip})}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr)\]

animate(p, fps = 25, duration = 20, end_pause = 100, renderer = gifski_renderer())

3 Bootstrap

confint_bootstrap <- function(nb, x, y, model) {
# Constructs a 95% confidence interval based on nonparametric bootstrap
# Does this nb-times as a simulation
# Returns the percentage of cases in which the CI covers the true value 
  }
nb = 5000
confidence <- confint_bootstrap(nb, x, y, model)
confidence
## [1] 0.908